

#### UNLEASH THE REVOLUTION IN NEXT-GEN COMPUTING

© 2024 ENFABRICA CORPORATION. ALL RIGHTS RESERVED.

# Foundational Networking Silicon for the Accelerated Computing Era



SHRIJEET MUKHERJEE SHRIJEET@ENFABRICA.NET

© 2024 ENFABRICA CORPORATION. ALL RIGHTS RESERVED.

### :: the Supers : mainframe, ccNuma







| NUMAlink<br>5 | 2009 | 7.5 GB/s     | 500ns<br>(max) | <u>Altix UV</u>              |
|---------------|------|--------------|----------------|------------------------------|
| NUMAlink<br>8 | 2017 | 13.3<br>GB/s | 300ns<br>(max) | HPE<br>Superdom<br>e<br>Flex |

https://en.wikipedia.org/wiki/List\_of\_interface\_bit\_rates, https://www.researchgate.net/figure/Performance-of-random-ordered-ring-latency-for-Altix-POWER5-and-ICE-systems\_fig4\_220782298, https://www.cs.umd.edu/class/spring2017/cmsc714/Readings/SGL-LW-4192\_pdf

### :: the Supers : mainframe, ccNuma



https://en.wikipedia.org/wiki/List\_of\_interface\_bit\_rates, https://www.researchgate.net/figure/Performance-of-random-ordered-ring-latency-for-Altix-POWER5-and-ICE-systems\_fig4\_220782298 https://www.cs.umd.edu/class/spring2017/cmsc714/Readings/SGI-UV-4192.pdf,

https://wiki.preterhuman.net/SGI\_Origin\_2000#/media/File:Sgi-origin-2000-dual-rack-uni-koln.jpg,

#### **Optimal performance is at the "node" level**

- 2-4 processor sockets
- SMP shared memory
- NUMA was the answer to SMP scale issues

#### Cross node performance based on "network performance"

- Multiple paths
- latencies in sub-microsecond range
- All accesses are IPC in nature

#### Entire system memory is coherent

- Variable latency at high scales
- Needs Torus, dragonfly style topologies
- Needs efficient page locality management

http://condor.cc.ku.edu/~grobe/docs/sgi-short-intro/index.shtml#SSMP

### :: borg : rise of scale out computing







Tightly coupled coherent systems scale complexity exponentially

**Cross fabric scheduling complexity** 

#### **Application level tuning complexity**

- Static core pinning exposed system placement to user
- OS driven placement high system utilization but variable performance

#### The answer – Borg

| Linux TCP<br>stack | 2009 | 1.25<br>GB/s | 50ms  | UCS VIC<br>Intel<br>ixgbe |
|--------------------|------|--------------|-------|---------------------------|
| RDMA on<br>RoCEv2  | 2018 | 12.5<br>GB/s | 1-5us | Mellanox<br>Intel         |

### :: borg : rise of scale out computing



All green links are RPC communication

**Processing tasks assembled out of tasklets** 

Tasklets are combined using RPCs to build processing pipelines

• Stages of pipeline can live anywhere

**Nodes are sized to house entire tasklet** (or multiple)

## Tasklet's can fail without affecting overall performance

• Redundancy, over-provisioning built-in

#### Programming model is framework library based

 Enables building RPC boundaries at appropriate interfaces

### :: ML monsters : domain specific accelerators



#### Both models needed to evolve. A modern truly scalable solution demands

- The tight performance of a supercomputer, with thread scaling
- Resiliency of a cloud scale system

## Will adapt the programming model for performance

- Low level (like CUDA, the new assembly language)
- High level, structured kernels for computation and communication

### :: let's not "forget" memory



#### Super computers had strict control over memory hierarchies with tight control

- Load / Store was always the preferred programmer's interface
- Numa machines started stretching these limits

#### Distributed cloud systems allow wide range of latencies and bandwidth on the network interconnects

- Packet based communication and RDMA are the programmer's interface
- This design is getting stretched by the rapid expansion of memory and coherent communication footprints

### Is there a best of both worlds?

### :: let's not "forget" memory



|                    | Latency       | Bandwidth / Channel      | Max<br>Capacity* | Significance                 | Programmers View                                      |
|--------------------|---------------|--------------------------|------------------|------------------------------|-------------------------------------------------------|
| Reg                | 0.2ns         |                          | KB               |                              | L1 – dereference<br>pointer                           |
| Cache              | 40ns          |                          | KB               |                              |                                                       |
| DDR (Main)         | 80-140ns      | 32-51.2 GB/s (DDR5)      | Up to 4TB        | In CPU                       | L2 – dereference<br>pointer high perf<br>memcpy       |
| DDR (NUMA)         | 170-250ns     | 32-51.2 GB/s (DDR5)      | Up to 8TB        |                              |                                                       |
| DDR (CXL)          | 170-250ns     | 32-51.2 GB/s (DDR5)      | 2-4 TB           | CPU independent<br>but local | L3 – dereference<br>pointer high perf<br>memcpy, swap |
| DDR (CXL Switched) | 300-400ns     | 32-51.2 GB/s (DDR5)      | 64TB             |                              |                                                       |
| Far Memory         | <b>2-4</b> us | 100 GB/s (800g ethernet) | infinite         | Network                      | L4 – memcpy, swap                                     |
| SSD                | 50-100us      |                          |                  | attached                     | L5 – memcpy, swap                                     |

### :: let's not "forget" memory



|  |                    | Latency   | Bandwidth / Channel      | Max<br>Capacity* | Significance    | Programmers View                                |
|--|--------------------|-----------|--------------------------|------------------|-----------------|-------------------------------------------------|
|  | Reg                | 0.2ns     |                          | КВ               |                 | L1 – dereference<br>pointer                     |
|  | Cache              | 40ns      |                          | КВ               |                 |                                                 |
|  | DDR (Main)         | 80-140ns  | 32-51.2 GB/s (DDR5)      | Up to 4TB        | In CPU          | L2 – dereference<br>pointer high perf<br>memcpy |
|  | DDR (NUMA)         | 170-250ns | 32-51.2 GB/s (DDR5)      | Up to 8TB        |                 |                                                 |
|  | DDR (CXL)          | 170-250ns | 32-51.2 GB/s (DDR5)      | 2-4 TB           | CPU independent | L3 – dereference                                |
|  | DDR (CXL Switched) | 300-400ns | 32-51.2 GB/s (DDR5)      | 64TB             | but local       | тетсру, swap                                    |
|  | Far Memory         | 2-4us     | 100 GB/s (800g ethernet) | infinite         | Network         | L4 – memcpy, swap                               |
|  | SSD                | 50-100us  |                          |                  | attached        | L5 – memcpy, swap                               |

DDR DDR

#### © 2024 ENFABRICA CORPORATION. ALL RIGHTS RESERVED.

### :: modern systems : compute accelerator

DDR 🗉 DDR

DDR DDR

DDR 🗉 DDR

#### GPU GPU GPU GPU GPU GPU GPU GPU DDR 🗉 DDR DDR DDR DDR DDR DDR 📰 DDR 📄 DDR DDR DDR DDR DDR DDR CPU **PCIe Switch** CPU PCIe Switch DDR DDR DDR 🗉 DDR NIC DDR DDR DDR 🗉 DDR **PCIe Switch PCIe Switch** CPU CPU DDR 🖬 DDR DDR 📰 DDR DDR DDR DDR DDR DDR 🗉 DDR DDR DDR DDR 🗉 DDR DDR 🗉 DDR GPU GPU GPU GPU GPU GPU GPU GPU DDR DDR DDR 🖬 DDR DDR 🖬 DDR DDR 📰 DDR DDR 🗉 DDR DDR 📰 DDR DDR 🗉 DDR DDR 🗉 DDR

DDR DDR

DDR 🗉 DDR

DDR 🗉 DDR

#### Interesting mix of

- Scale-out coherent domain
- Scale-up bulk data movement domain
- •BW on all interfaces are roughly balanced

### •Ratios of each of the communication types hard wired in the design

- CCL's have evolved to try and match the system design
- Designed for max performance, cost is an afterthought







### :: accelerated compute fabric - superNIC







### :: what we are building

ultra-scalable networking silicon & software for high-performance / AI compute



### :: why do it this way?





#### It's a Collective 8X NIC for Collective GPUs

- 8X scale-out bandwidth of RDMA NICs
- Can load-distribute across GPUs

#### **It Cuts Down Network Latencies**

- 50 66% fewer device hops
- Better network-to-GPU traffic engineering
- Mitigates incast problem

#### It Lowers AI Cluster TCO

- Allows GPUs to run hotter
- Disaggregates and elasticizes memory

#### Data movement can, for free

- Shuffle data
- Sparsify or Densify data

### :: why do it this way?



## Binds compute facing queues/ops with fabric facing queues

- Queues can communicate in the scale-up domain coherently
- Queues can communicate using RDMA in the scale-out domain

### Provides bandwidth matched to local busses

- Effectively acts like the page mover in a NUMA controller using RDMA on the remote links
  - Standard ethernet features line link aggregation and MP routing to provide link scaling

### :: scale and resiliency at scale : the game





#### **Multi-railed coherent interconnects**

- Enables bandwidth aggregation
- Coupled with sophisticated data movers enables routing around hotspots

#### Multi railed ethernet networks

- Enables per packet routing
- Application Queue to network Queue mapping enables precise QOS and rate management

#### The combination

- Build wider networks with lower depth
  - Spread messages over multiple links
- Dramatically increase Resiliency
  - Use network multi-pathing and failover mechanisms to not strand compute

### :: conventional nic vs. acf-s





| RDMA NIC /<br>DPU                        | Solution Property         | ACF-S                                                               | Net | work Fat |
|------------------------------------------|---------------------------|---------------------------------------------------------------------|-----|----------|
| 400 Gbps                                 | Network Bandwidth         | 3.2 Tbps                                                            |     |          |
| 1-2 ports @<br>400G<br>4 – 8 @ 1000      | Network Radix             | 4 ports @ 800G<br>8 ports @ 400G<br>32 ports @<br>100G              |     |          |
| Not yet                                  | Single-flow 800G          | Yes                                                                 |     | ACF-S    |
| Yes                                      | GPU-direct RDMA           | Yes                                                                 |     |          |
| No                                       | PCle link<br>aggregation  | Yes                                                                 |     |          |
| No                                       | Embedded CXL<br>switch    | Yes                                                                 |     |          |
| Fixed or<br>Vendor-Config<br>red<br>RoCE | gu Al Transport           | User<br>SW-Defined<br>RoCE, TCP,<br>Spray support                   |     |          |
| Verbs                                    | AI Transport API          | Verbs                                                               |     |          |
| DCQCN + PF                               | C Congestion<br>avoidance | DCQCN or<br>Pacing + Fast<br>Flow Control<br>(FFC);<br>Packet Spray |     |          |
| Small                                    | Incast Buffer             | Large, Shared                                                       |     |          |

### :: use cases : large training fabric





### :: ACF as a global memory component





### :: LLM inference at scale





| Example<br>DLRM                | Require<br>d<br>QPS | #<br>Require<br>GPUs          | #<br>Required<br>CPUs         | User<br>Context<br>Capacity |
|--------------------------------|---------------------|-------------------------------|-------------------------------|-----------------------------|
| User Context<br>in<br>CPU DRAM | 1K                  | 128                           | 16                            | 80K                         |
| User Context<br>in<br>ACF DRAM |                     | onse Time Dis<br>Response Tim | tribution,<br>e with 4 Server | 80K<br>\$                   |



GPU capacity needs to be severely overprovision to meet latency requirements

### :: use cases : in memory dbase



MemCache



### :: 8 terabit/sec acf-s pilot system for customer testing

#### **GPU networking node**

- o 4 x 800G OSFP Ethernet
- 10 x16 PCIe cabled

#### **In-Network Memory Node**

- o 4 x 800G OSFP Ethernet
- 8 18TB CXL DDR5

### In Manufacturing Now Orderable



#### 8 Tbps AI Networking Node

- Connect any combination of GPUs, CPUs, CXL memory, SSD to network
- Programmable Network Transport: RoCE, RDMA over TCP, UEC-direct
- Replaces NICs, PCIe switches, Ethernet TOR

#### 800G server I/O

 Composable, modular, production-grade

### :: rack-and-stack deployment





### :: learn more // engage with us





# Thank You.

### SHRIJEET MUKHERJEE SHRIJEET@ENFABRICA.NET